INDEX.ME[UP,DOC]2 - www.SailDart.org

perm filename INDEX.ME[UP,DOC]2 blob sn#748786 filedate 1984-04-03 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00005 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	KEYWORD INDEXING, RETRIEVAL & DISPLAYING OF TEXT FROM A USER'S FILES
C00006 00003	INDEXING ARTICLES
C00010 00004	RETRIEVAL FROM A USER INDEXED TEXT SYSTEM
C00015 00005	FORMAT OF THE .TXT FILE
C00018 ENDMK
C⊗;
KEYWORD INDEXING, RETRIEVAL & DISPLAYING OF TEXT FROM A USER'S FILES

The NS system can be used to index and retrieve from a user's files
by keywords.  The system program INDEX is used to initially index a
user's text and also is used to add and index further text.  This
writeup describes this use of the program INDEX and how to use NS to
retrieve such indexed text.  Please MAIL to ME any comments,
suggestions or reports of bugs concerning the use of INDEX and NS. 

Retrieval of text is by portions called articles.  An article of a
user's text is comparable to a news service story, although there is
no provision for linking separate articles together the way stories
sometimes are.  The size of an individual article is limited to about
1.5K words (which is about 100 lines of 75 characters each).  The
length is also limited to a maximum of 275 lines.  The maximum text
line length is 83 characters; longer lines would not be presentable
on Data Disc displays.  Stories or lines longer than the maximum
permitted lengths will be broken in two (with a warning issued to the
user). 

Indexing is done using every word as a keyword except for about 100
very common words (such as "the", "and", "but", etc.) which are
ignored.  Before the indexing is done, certain suffixes are removed. 
The suffixes currently removed are: -S, -ES, -ING, -ED, -LY. 


FILE USAGE

Each user can have any number of independently indexed text systems.
A particular text system is identified by a primary file name and a
PPN.  The stored text and the retrieval data for a particular system
are stored in two files with the given primary name and PPN and the
respective extensions .TXT and .DAT.  These files MUST NOT BE EDITTED
OR OTHERWISE CHANGED except with the program INDEX.  The .DAT file
contains absolute pointers into the .TXT file so that retrieval can
make use of the random access provisions of our disk system, and any
editing of the .TXT file will invalidate these pointers.  If it is
necessary to edit individual articles, the .DAT file must be deleted
and the text must be re-indexed from scratch.  More details later on
how to do that. 

Text to be indexed into the retrieval system is read by INDEX from
another file with the same primary name and PPN as the .TXT and .DAT
files but with the extension .TFL.  The .TFL file is considered a
temporary file and is deleted by INDEX after it has been processed.
The format of the .TFL file is described below. 
INDEXING ARTICLES

To index new articles (whether or not there are any old ones), type
the monitor command R INDEX.  INDEX will then ask for the filename of
the information system to be used.  You should type the primary file
name (with no extension); PPN is optional--your current alias is used
if no explicit PPN is given.  INDEX will then look for the .TFL file
with the given name and complain if it is not found.  If there is no
old .TXT or .DAT file with the given name, new ones will be created. 
If only one of the .TXT and .DAT files exists, INDEX will again
complain.  If everything is okay, INDEX will 1) add the articles from
the .TFL file to the end of the .TXT file, 2) update the .DAT file,
and then 3) delete the .TFL file.  To allow you to monitor INDEX's
progress, the first few characters of each article are typed out as
the article is processed.  Carriage return/linefeed pairs in these
first few characters are typed out as dashes (-'s), and a couple of
spaces are typed between successive articles' first few characters. 
Typical running time for INDEX is about a minute of CPU time for 30K
to 40K of text, or about one second for about 50 lines of text. 


FORMAT OF THE .TFL FILE--FORMFEEDS AND ALTERNATE DELIMITERS

The temporary .TFL file should contain each new article on a separate
page.  Any E directory page at the front of a .TFL file will be
ignored, but SOS format files are NOT permitted.  Leading and extra
trailing carriage returns and linefeeds in an article are ignored.
If you wish to INDEX a .TFL file that uses some character other than
formfeed to delimit articles, then end the file name typed to INDEX
with an ALTMODE instead of a carriage return; INDEX will then ask you
for the alternate delimiter.  This allows you, for instance, to INDEX
saved message files, in which the character partial-sign (∂)
separates messages; or, by typing carriage return as the alternate
delimiter, you can specify that blank lines delimit articles in the
.TFL file.  The alternate delimiter will be recognized as an article
delimiter only when occurring at the beginning of a line, although it
can occur either at the beginning or at the end of each article.
When an alternate delimiter is being used, formfeeds in the .TFL file
will be ignored.  Whatever the delimiter in the .TFL file, INDEX will
use formfeed as the article delimiter in the .TXT file.
RETRIEVAL FROM A USER INDEXED TEXT SYSTEM

To retrieve articles from an INDEXed file, use NS (started by monitor
command NS) and select the file by typing the NS command character
equal sign (=) followed by the name of the file and a carriage
return, e.g., =NOTICE[UP,DOC]<cr>.  This automatically puts you into
/AGAIN mode though you can then change to /-AGAIN mode if you so
desire.  While you have a user file selected, all the usual features
of NS are available except that the time/date range has no effect.
To find out the name of the user file you currently have selected,
type =?<cr>.  To deselect a user file (and thus reselect the news),
type just =<cr>.  To get from NS this information on how to retrieve
articles from your own INDEXed files, type the NS command ?INDEX<cr>.  


NS INDEXED FILE COMMAND SUMMARY

COMMAND		MEANING

=<filename>	Evaluate subsequent keyword expressions using the
		indexed text system named by <filename>.
=		Evaluate subsequent keyword expressions using the
		wire news.
=?		Type out the name of the current text source.
?INDEX		Give some help on retrieval of text from user files.
FORMAT OF THE .TXT FILE

For convenience, the articles in the .TXT file are separated by
formfeeds so that the third article, say, will appear on page three
of the .TXT file.  To be precise, the end of each article is marked
by one to five null bytes to fill out a word and then a word
containing a formfeed followed by four nulls.  Tabs appearing in a
.TFL file are converted into the appropriate number of spaces before
going into the .TXT file.  Other minor changes to the text are made
where necessary to conform to the NS text structure: bare LFs and
bare CRs are changed into CRLFs; blank lines are made to contain a
single space character; long lines (more than 83 characters) are
broken by insertion of a hyphen (-) and a CRLF; and long stories
(more than 7500 characters or more than 275 lines) are broken in two.


EDITING ALREADY-INDEXED ARTICLES

The .TXT file must not be editted or otherwise changed except with
the program INDEX.  Therefore, if it is necessary to edit articles
already indexed or to rearrange such articles, the following
procedure should be followed. 

1) Delete the .DAT file (which contains the indexing information).
2) Rename the .TXT file to .TFL.
3) Edit the .TFL file as desired.  Each old article will appear on a
   separate page in exactly the format required of .TFL files by
   INDEX.  Remember that INDEX does not accept SOS files as input. 
   If you must edit with SOS, COPY over the .TFL file with the /N
   switch to eliminate line numbers. 
4) R INDEX on the edited .TFL file.  (Be sure to have done 1 above.)
   This will create the .TXT and .DAT files and delete the .TFL file.